Residual Tokens Enhance Masked Autoencoders For Speech Modeling

Authors
Affiliation

Samir Sadok

Inria at Univ. Grenoble Alpes, CNRS, LJK, France

Stéphane Lathuilière

Xavier Alameda-Pineda

Abstract

Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness

Summary

  • Explicit / Residual separation
    Speech generation is decomposed into:
    • explicit attributes (e.g., pitch, linguistic content, speaker identity) for interpretable control,
    • continuous residual tokens capturing phenomena not modeled by attributes.
  • Residual tokens via cross-attention
    • Inspired by the Perceiver, a fixed set of learnable queries extracts residual information from the spectrogram.
    • Provides a compact and controlled representation independent of sequence length.
  • Complementarity and flexibility
    • Residual tokens enrich explicit attributes, enabling speech that is both more natural and controllable.
  • Dropout-based regularization
    • Targeted dropout on residual tokens prevents over-reliance.
    • Enforces effective use of explicit attributes, ensuring interpretability and controllability.
  • Unified MAE-based architecture
    • Built upon the MAE paradigm: discrete tokens for spectrograms and attributes, continuous tokens for residuals, with Transformer encoding and HiFi-GAN decoding.
    • Partial masking during training encourages intra- and inter-modal dependency learning.

👉 RT-MAE introduces continuous residual tokens extracted via cross-attention and regularized by dropout, combining controllability from explicit attributes with flexibility from residuals within an MAE framework for speech generation.

Illustration of the RT-MAE architecture with residual tokens.
Figure 1

Article (.pdf)

Citation